{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "28af9f4c",
   "metadata": {},
   "source": [
    "# Weaviate Import\n",
    "\n",
    "This notebook is used to populate the `WeaviateBlogChunk` class.\n",
    "\n",
    "You can connect to Weaviate through local host, or create a free 14-day sandbox on [WCS](https://console.weaviate.cloud/)!\n",
    "\n",
    "1. (Option 1) Create a cluster on WCS and grab your cluster URL and auth key (if enabled)\n",
    "\n",
    "1. (Option 2) Run `docker-compose up -d` with the docker script in the file to start Weaviate locally on localhost:8080\n",
    "\n",
    "\n",
    "2. Make sure the `/blog` folder is in this directory (these are parsed from github.com/weaviate/weaviate-io -- feel free to drag and drop that folder in here to update the content).\n",
    "\n",
    "\n",
    "3. Run this notebook and the 1182 blog chunks will be loaded into Weaviate."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db1c926e",
   "metadata": {},
   "source": [
    "## Connect to Client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cf69ba40",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "I0000 00:00:1721914628.345996 3818947 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache\n"
     ]
    }
   ],
   "source": [
    "# Import Weaviate and Connect to Client\n",
    "import weaviate\n",
    "\n",
    "client = weaviate.connect_to_local()  # Connect to local host\n",
    "# client = weaviate.connect_to_wcs(\n",
    "#     cluster_url=\"WCS-url\",  # Replace with your WCS URL\n",
    "#     auth_credentials=weaviate.auth.AuthApiKey(\"auth-key\"),  # Replace with your WCS key\n",
    "#     headers={\n",
    "#         'X-Cohere-Api-Key': (\"API-Key\") # Replace with your Cohere API key\n",
    "#     }\n",
    "# )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19bb10a5",
   "metadata": {},
   "source": [
    "## Create Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8b209831",
   "metadata": {},
   "outputs": [],
   "source": [
    "# CAUTION: Running this will delete the collection along with the objects\n",
    "\n",
    "# client.collections.delete_all()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f3643f23",
   "metadata": {},
   "outputs": [],
   "source": [
    "import weaviate.classes.config as wvcc\n",
    "\n",
    "collection = client.collections.create(\n",
    "    name=\"WeaviateBlogChunk\",\n",
    "    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere\n",
    "    (\n",
    "        model=\"embed-multilingual-v3.0\"\n",
    "    ),\n",
    "    properties=[\n",
    "            wvcc.Property(name=\"content\", data_type=wvcc.DataType.TEXT),\n",
    "            wvcc.Property(name=\"author\", data_type=wvcc.DataType.TEXT),\n",
    "      ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19336940",
   "metadata": {},
   "source": [
    "## Chunk Blogs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9a6788d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "\n",
    "def chunk_list(lst, chunk_size):\n",
    "    \"\"\"Break a list into chunks of the specified size.\"\"\"\n",
    "    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]\n",
    "\n",
    "def split_into_sentences(text):\n",
    "    \"\"\"Split text into sentences using regular expressions.\"\"\"\n",
    "    sentences = re.split(r'(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s', text)\n",
    "    return [sentence.strip() for sentence in sentences if sentence.strip()]\n",
    "\n",
    "def read_and_chunk_index_files(main_folder_path):\n",
    "    \"\"\"Read index.md files from subfolders, split into sentences, and chunk every 5 sentences.\"\"\"\n",
    "    blog_chunks = []\n",
    "    for folder_name in os.listdir(main_folder_path):\n",
    "        subfolder_path = os.path.join(main_folder_path, folder_name)\n",
    "        if os.path.isdir(subfolder_path):\n",
    "            index_file_path = os.path.join(subfolder_path, 'index.mdx')\n",
    "            if os.path.isfile(index_file_path):\n",
    "                with open(index_file_path, 'r', encoding='utf-8') as file:\n",
    "                    content = file.read()\n",
    "                    sentences = split_into_sentences(content)\n",
    "                    sentence_chunks = chunk_list(sentences, 5)\n",
    "                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]\n",
    "                    blog_chunks.extend(sentence_chunks)\n",
    "    return blog_chunks\n",
    "\n",
    "# Example usage\n",
    "main_folder_path = './examples/weaviate_setup/blog'\n",
    "blog_chunks = read_and_chunk_index_files(main_folder_path)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ed58c948",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1643"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(blog_chunks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1ea97830",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"---\\ntitle: 'Accelerating Vector Search up to +40% with Intel’s latest Xeon CPU - Emerald Rapids'\\nslug: intel\\nauthors: [zain, asdine, john]\\ndate: 2024-03-26\\nimage: ./img/hero.png\\ntags: ['engineering', 'research']\\ndescription: 'Boosting Weaviate using SIMD-AVX512, Loop Unrolling and Compiler Optimizations'\\n---\\n\\n![HERO image](./img/hero.png)\\n\\n**Overview of Key Sections:**\\n- [**Vector Distance Calculations**](#vector-distance-calculations) Different vector distance metrics popularly used in Weaviate. - [**Implementations of Distance Calculations in Weaviate**](#vector-distance-implementations) Improvements under the hood for implementation of Dot product and L2 distance metrics. - [**Intel’s 5th Gen Intel Xeon Processor, Emerald Rapids**](#enter-intel-emerald-rapids)  More on Intel's new 5th Gen Xeon processor. - [**Benchmarking Performance**](#lets-talk-numbers) Performance numbers on microbenchmarks along with simulated real-world usage scenarios. What’s the most important calculation a vector database needs to do over and over again?\""
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "blog_chunks[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b9261a1",
   "metadata": {},
   "source": [
    "## Import Objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "873567be",
   "metadata": {},
   "outputs": [],
   "source": [
    "from weaviate.util import get_valid_uuid\n",
    "from uuid import uuid4\n",
    "\n",
    "blogs = client.collections.get(\"WeaviateBlogChunk\")\n",
    "\n",
    "for idx, blog_chunk in enumerate(blog_chunks):\n",
    "    upload = blogs.data.insert(\n",
    "        properties={\n",
    "            \"content\": blog_chunk\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72f31202",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
